The increasing demand for energy had a major negative impact on the environment through increasing air pollutant emissions. The term air pollutant covers all substances which may harm living beings. Kinga Skalska (2010) points out the two main air pollutants sources: combustion processes of fossils fuels used in power plants and vehicles. Among the combustion-generated air contaminants there are the nitrogen oxides (NOx), considered the primary pollutants of the atmosphere, since they are responsible both for environmental problems like photochemical smog, acid rain, tropospheric ozone, ozone layer depletion and even global warming, and for many health problems in humans exposed to high concentrations of these gases. The abbreviation NOx usually relates to nitrogen monoxide NO and nitrogen dioxide NO2. Thanks to the increased environmental awareness both in public opinion and political circles, there is a special concern in reducing the emissions from power plants. In order to tackle the problem these emissions are limited by restrictive environmental rules adopted in different parts of the world, such as the directives adopted in Europe which limit the emissions to 25 ppmmdv (parts per million by dry volume). Monitoring the emissions during the combustion operations in a power plant, becomes a crucial operation. Three are the main adopted monitoring strategies:
periodic measurements: performed with calibrated equipment by testing laboratories;
continuous emission monitoring system (CEMS): through proper calibrated sensors it provides real-time information on emissions, sensors must be subjected to periodic maintenance according standard procedures;
predictive emission monitoring system (PEMS): predictive models which take in input some process variables, and, trained on past data, estimate the emission components.
Despite the fact PEMS is not yet regulated by law, yet it is a useful validation method for CEMS outputs: it could be used to estimate the periods of CEMS maintenance and can serve as backup of CEMS during failure/maintenance.
The aim of the work is to create a PEMS, which essentially carries out a regression task: the estimation of the relation that exists between a response variable and one or more independent explanatory variables. After a brief introduction of the work environment, there is a description of the dataset and the exploratory data analysis in order to better understand the data. A theoretical introduction about the linear regression analysis follows the previous part and then the evaluation metrics used in order to access the model performance are defined. The theoretical introduction is followed by the definition of some linear regression models, and eventually the statistical assumptions of the best model are assessed. After the limits of the best linear regression model have been investigated, a widely used decision-rule based regression algorithm, namely Random Forest, is tested to try to overcome those limits. Finally the performances of the two models are compared on previous unseen data.
The project is an R notebook completely developed in R, a language for statistical computing, through the RStudio IDE, which exploiting R markdown lets you create professional and well organized data science project reports. In a R notebook code chunks are interspersed by markdown chunks where you can explain each analysis phase, and both plots and tables can be visualized to clearly communicate the results. The R notebook is then knit to create a pdf or a html document. The set of loaded libraries provide useful functions, it is worth mentioning KableExtra used to create awesome html tables, ggplot and plotly, used to create plots, and the novel machine learning library mlr3 concerning the Random Forest.
# SOME USEFUL LIBRARIES=========================================================
library(leaps)
library(magrittr) # pipe
library(forestmangr) # to round values in a df
library(car)
library(data.table)
library(MASS)
# TABLES AND PLOTS==============================================================
library(grid)
library(scales)
library(tidyverse)
library(kableExtra) # custom tables
library(ggpubr) # ggrrange
library(plotly) # 3d plot
library(ggcorrplot) # correlation heatmap
# RANDOM FOREST=================================================================
library(mlr3)
library(mlr3viz)
library(mlr3learners)
library(paradox)
library(mlr3tuning)
fill_color <- "#4a6ad4"# DATA LOADING==================================================================
path <- "/home/alessandro/Documents/DataScienceEng/1_2/Mathematics_in_Machine_learning/Mathematics-for-Machine-Learning-final-project/pp_gas_emission/"
# dataframe initilization
emissions_raw <- data.frame()
for (year in 2011:2015){
# create the complete file path
file_path <- paste(path, "gt_", year, ".csv", sep = "")
# retrieve current year data
year_data <- read.csv(file_path, header = TRUE)
# add 'year' attribute
year_data$year <- rep(year, length(year_data[, 1]))
emissions_raw <- rbind(emissions_raw, year_data)
}
# drop CO column
emissions <- emissions_raw %>% select(-CO)
# convert year in a factor
emissions$year <- as.factor(emissions$year)
# shuffle the dataset
set.seed(42)
n <- nrow(emissions)
rows <- sample(n)
emissions <- emissions[rows, ]
row.names(emissions) <- NULL
# check for na values-----------------------------------------------------------
num_na <- sum(is.na(emissions))
# train validation and test sets------------------------------------------------
# test indexes
test <- sample(1:n, round(0.25*n))
remaining <- setdiff(1:n, test)
# validation indexes
val <- sample(remaining, round(0.25*n))
# training indexes
train <- setdiff(remaining, val)
# ATTRIBUTES TABLE==============================================================
# table creation----------------------------------------------------------------
attributes_df <- data.frame(
Attribute = c(
"Ambient temperature",
"Ambient pressure",
"Ambient humidity",
"Air filter difference pressure",
"Gas turbine exhaust pressure",
"Turbine inlet temperature",
"Turbine after temperature",
"Turbine energy yield",
"Compressor discharge pressure",
"Nitrogen oxides",
"Data collection year"
),
Abbr = colnames(emissions),
Unit = c(
"$°C$",
"mbar",
"(%)",
"mbar",
"mbar",
"$°C$",
"$°C$",
"MWH",
"mbar",
"$mg/m^3$",
"none"
),
min = apply(emissions[c(train, val), -11], 2, min) %>% round(digits = 2) %>% append(values="2011"),
max = apply(emissions[c(train, val), -11], 2, max) %>% round(digits = 2) %>% append(values="2015")
)
rownames(attributes_df) <- NULL
# table visualization-----------------------------------------------------------
attributes_df %>%
kbl(caption = "Dataset attributes", booktabs = TRUE) %>%
kable_styling(full_width = F, position = "float_right")| Attribute | Abbr | Unit | min | max |
|---|---|---|---|---|
| Ambient temperature | AT | \(°C\) | -6.04 | 37.1 |
| Ambient pressure | AP | mbar | 986.16 | 1036.6 |
| Ambient humidity | AH | (%) | 24.09 | 100.2 |
| Air filter difference pressure | AFDP | mbar | 2.09 | 7.61 |
| Gas turbine exhaust pressure | GTEP | mbar | 17.7 | 40.72 |
| Turbine inlet temperature | TIT | \(°C\) | 1000.8 | 1100.9 |
| Turbine after temperature | TAT | \(°C\) | 511.04 | 550.61 |
| Turbine energy yield | TEY | MWH | 100.02 | 179.5 |
| Compressor discharge pressure | CDP | mbar | 9.85 | 15.16 |
| Nitrogen oxides | NOX | \(mg/m^3\) | 27.18 | 119.91 |
| Data collection year | year | none | 2011 | 2015 |
The dataset was retrieved from UCI Machine Learning Repository at this link. It contains data concerning a gas turbine located in Turkey’s north western region for the purpose of studying flue gas emissions, namely CO and NOx (NO + NO2). The dataset has 36733 instances of 11 sensor measures aggregated over one hour, collected over a five years period, from 01/01/2011 to 31/12/2015, and sorted by chronological order. The dataset does not contain any missing value. For work-organization and clarity sake only the NOx response variable is studied, while CO is dropped from the dataset. There is no time reference among the available attributes, the year attribute has been added only to show how the empirical density of the response variable slightly changes over the years, it is not used during the analysis. Due to this fact the dataset splitting approach adopted by Heysem KAYA (2019) (2011-2012 train, 2013 validation and 2014-2015 test) was abandoned in favor of a shuffling strategy and a subsequent partition using 50% as train, 25% as validation, and the remaining 25% as test. The test set has been created and used only for the test purpose during the analysis, avoiding any possible insight about its characteristics. Table 3.1 summarizes the attribute names and ranges, while their meaning is clarified in the next section. The dataset visualization follows below.
knitr::include_graphics(here::here("/home/alessandro/Documents/DataScienceEng/1_2/Mathematics_in_Machine_learning/Mathematics-for-Machine-Learning-final-project/gas_turbine.png"))Figure 3.1: Gas turbine main components and measurement positions.
The gas turbine in figure 3.1 is the prime mover for an electricity generator. It is able to transform the chemical energy in the fuel (natural gas or similar fuel), into mechanical energy, which is transferred from the turbine exit shaft to the generator’s shaft through a gearbox. Air is let into the turbine through the air intake, going across the compressor. This component is essential to guarantee a self-sustaining combustion process. At first an initial momentum is imparted to the turbine rotor from an external engine, until the compressor reaches a certain speed. The air is incrementally compressed as it passes through each compressor stage increasing both its pressure and temperature and then reaches the combustion chamber, where it is mixed with a proper amount of natural gas. The Air/Gas ratio depends on a series of factors: air quality, heating value of the gas, humidity and so on. An ignition system steps in providing heat, and will be put out of service as soon as the fire is established and stabilized in the combustion chamber. The high pressure exhaust gas produced in the combustion chamber is applied to the turbine blades inducing the rotation of the turbine shaft, and then is conducted to the exhaust stack. The rotation of the shaft drives the compressor to draw in and compress more air to sustain continuous combustion.
The data has been already split into train, val and test and EDA is performed only to the train an val part merged together. The data exploratory phase starts with a look at the output variable empirical density in figure 3.2. The histogram is an approximated representation of the real NOx probability density, it gives a rough sense of the density associated to the underlying distribution of the NOx data. The range of the values that NOx can take is divided into intervals of equals width. The height of each bar equals the empirical density associated to the corresponding interval, and it is defined as the empirical probability divided by the interval width, where the probability is computed as the number of sample observations within the interval divided by the total number of observations. It appears to be positively skewed, and this is a problem because the normality of the dependent variable is an assumption in performing linear regression. The Q-Q plot clearly highlights the normality assumption violation.
# ORIGINAL RESPONSE VARIABLE HISTOGRAM =========================================
respHist <- ggplot() +
geom_histogram(data = emissions[-test, ], aes(x = NOX, y = ..density..), fill=fill_color) +
xlab("NOx")+
ylab("emp. density")+
ggtitle("Response variable")+
theme(plot.title = element_text(hjust = 0.5))
respQQ <- ggplot(data = emissions[-test, ], aes(sample = NOX))+
geom_qq(alpha = 0.5, size = 0.8)+
stat_qq_line(color = "red", alpha = 0.7, size = 0.5, linetype = 2)+
ggtitle("Normality assessment")+
theme(plot.title = element_text(hjust = 0.5))
ggarrange(respHist, respQQ, ncol = 2, nrow = 1)Figure 3.2: NOx empirical density function.
To tackle the problem we can try to do a box-cox transformation of the response variable.
\[\begin{equation} y(\lambda) = \begin{cases} \frac{y^\lambda-1}{\lambda}\, \text{if}\,\lambda\neq0\\ log(y)\, \text{if}\,\lambda=0 \end{cases} \tag{3.1} \end{equation}\]
This is a family of exponential transformation which relies on a parameter called \(\lambda\). With the powerTransform() function in the car package we can generate a maximum-likelihood estimation of the power \(\lambda\) most likely to normalize the transformed response. The actual transformation is then applied through the bcPower() function.
# response transformation ------------------------------------------------------
transform <- powerTransform(emissions[-test, ]$NOX)
summary(transform)## bcPower Transformation to Normality
## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## emissions[-test, ]$NOX -0.6861 -0.69 -0.7408 -0.6315
##
## Likelihood ratio test that transformation parameter is equal to 0
## (log transformation)
## LRT df pval
## LR test, lambda = (0) 618.412 1 < 2.22e-16
##
## Likelihood ratio test that no transformation is needed
## LRT df pval
## LR test, lambda = (1) 3855.739 1 < 2.22e-16
lambda = transform$lambda
# add the transformed response as a new column ---------------------------------
emissions$bcNOX <- bcPower(emissions$NOX, lambda=lambda)Applying the function to the NOx response we obtain that the suggested power is -0.6861412, and that we can reject both the log-transformation and no-transformation hypothesis. Applying the transformation the situation gets better, even though we can see from the Q-Q plot that there are still some problems in the tails. From now on the transformed response variable will be called bcNOx.
# bcNOX empirical density ------------------------------------------------------
transHist <- ggplot() +
geom_histogram(data = emissions[-test, ], aes(x = bcNOX, y = ..density..), fill=fill_color)+
xlab("NOx transformed")+
ylab("emp. density")+
ggtitle("Transformed response variable")+
theme(plot.title = element_text(hjust = 0.5))
# Q-Q plot ---------------------------------------------------------------------
transQQ <- ggplot(data = emissions[-test, ], aes(sample = bcNOX))+
geom_qq(alpha = 0.5, size = 0.8)+
stat_qq_line(color = "red", alpha = 0.7, size = 0.5, linetype = 2)+
ggtitle("Normality assessment")+
theme(plot.title = element_text(hjust = 0.5))
ggarrange(transHist, transQQ, ncol = 2, nrow = 1)Figure 3.3: bcNOx empirical density.
In order to motivate the shuffling choice instead of creating splits according the years, figure 3.4 shows how the bcNOx empirical density slightly changes depending on the year. Generally speaking for a good model performance assessment the data used to validate and then to test the model should be drawn from the same distribution the training data was drawn from, assuming they have similar shapes, otherwise the whole procedure loses its meaning.
# Yearly bcNOX empirical density -----------------------------------------------
ggplot() +
geom_histogram(data = emissions[-test, ], aes(x = bcNOX, y=..density..), fill=fill_color)+
scale_x_continuous(breaks= pretty_breaks())+
facet_wrap(~year, nrow = 1, ncol = 5)+
theme(panel.spacing = unit(0.8, "lines"))+
xlab("bcNOx")+
ylab("emp. density")Figure 3.4: Yearly bcNOx empirical density.
To assess the existence of collinearity among the features the Pearson’s correlation coefficient can be computed. It is also known as bivariate correlation and its statistical meaning is measuring the linear correlation between two random variables X and Y. It lies within a range of -1 to +1 where +1 means total positive linear correlation, -1 means total negative linear correlation and 0 means no linear correlation. The coefficient is defined as follow:
\[\begin{equation} \rho_{X,Y}=\frac{Cov(X,Y)}{\sigma_X\sigma_Y}=\frac{E\big[(X-\mu_X)(Y-\mu_Y)\big]}{\sqrt{E\big[(Y-\mu_Y)^2\big]}\sqrt{E\big[(X-\mu_X)^2\big]}} \end{equation}\]
where \(Cov\) is the covariance, \(\sigma_X\) is the standard deviation of \(X\), and \(\mu_X\) is the mean of \(X\) (analogously for \(\sigma_Y\) and \(\mu_Y\)). Having a sample \(\{x_i, y_i\}_{i=1}^n\) it can be estimated as:
\[\begin{equation} r_{x,y}=\frac{\sum_{i=1}^{n}{(x_i-\overline{x})(y_i-\overline{y})}}{\sqrt{\sum_{i=1}^{n}{(x_i-\overline{x})^2}}\sqrt{\sum_{i=1}^{n}{(y_i-\overline{y})^2}}} \end{equation}\]
where \(n\) is the sample size, \(x_i\) is the i-th sample element and \(\overline{x}=\frac{1}{n}\sum_{i=1}^{n}{x_i}\) is the sample mean (analogously for \(y_i\) and \(\overline{y}\)).
Identifying closely related predictor variables is an important step in a regression context, in fact problems could arise since it is difficult to separate out the individual effect on the response of this kind of variables. The easiest solution adopted in this work is to select just some of the available predictors, as we will seen in the features selection section. Two are the main limits of the Pearson’s correlation coefficient:
it is bivariate, still it could also happen that a high linear correlation exists between a predictor and some of the others, namely multi-collinearity;
it only assess for linear relationship.
# Correlation Heat-Map ---------------------------------------------------------
year_col = 11 # drop year column
nox_col = 10 # drop NOX column
corr <- cor(emissions[-test, -c(year_col, nox_col)], emissions[-test, -c(year_col, nox_col)])
ggcorrplot(corr, hc.order = TRUE, type = "lower", lab = TRUE)+
ggtitle("correlation heatmap")+
theme(plot.title = element_text(hjust = 0.5))Figure 3.5: bivariate correlation computed among features.
Looking at the correlation heatmap of figure 3.5 there exist some highly linearly correlated input variables, also notice how the higher correlations occur between the turbine measure variables. The highest level of dependence exist between Compressor Discharge Pressure (CDP) and Turbine Energy Yield (TEY) (0.99), between Compressor Discharge Pressure (CDP) and Gas Turbine Exhaust Pressure (GTEP) (0.98) and also between Gas Turbine Exhaust Pressure (GTEP) and Turbine Energy Yield (TEY) (0.96). This means that some of the features may contain redundant information, and thus can be eliminated during the model training. Also notice the quite high correlation between the response variable bcNOx and the ambient temperature (AT) (-0.56): the negative sign means that the response variable tend to decrease on average when the ambient temperature increases.
The first analyzed model is the linear regression, which falls under the General Linear Model (GLM) broader framework. Some theoretical concepts concerning GLM will be now introduced. Let’s start saying that the main idea is to use a set of predictors collected in a matrix \(X\) in order to explain the probabilistic behavior of a set of quantitative responses \(y_1,\dots,y_n\) which are considered as realization of normal distributed random variables \(Y_1,\dots,Y_n\). What we want to find is the best linear approximation of the true unknown relationship between \(\boldsymbol{Y}\) and \(\boldsymbol{X}\), in other words:
\[\begin{equation} \boldsymbol{Y} = \boldsymbol{X\beta}+\boldsymbol{\epsilon} \end{equation}\]
more explicitly:
\[\begin{equation} \begin{bmatrix} Y_1\\ Y_2\\ \vdots\\ Y_n \end{bmatrix} = \begin{bmatrix} x_{11}&x_{12}&\cdots&x_{1p}\\ x_{21}&x_{22}&\cdots&x_{1p}\\ \vdots&\vdots&\ddots&\vdots\\ x_{n1}&x_{n2}&\cdots&x_{np} \end{bmatrix} \begin{bmatrix} \beta_0\\ \beta_1\\ \vdots\\ \beta_{p-1} \end{bmatrix} + \begin{bmatrix} \epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_n \end{bmatrix} \end{equation}\]
where:
\(\boldsymbol{Y}\) is the \(n \times 1\) response vector
\(\boldsymbol{X}\) is \(n\times p\) matrix containing the values of \(p-1\) predictors because the first column is usually a column of \(1\)’s. They are not random but can be considered as values under the control of the experimenter
\(\boldsymbol{\beta}\) is the \(p \times 1\) vector of unknown parameters object of the statistical inference
\(\boldsymbol{\epsilon}\) is a \(n\times 1\) vector of unobservable random variables which take into account the error due to different sources of uncertainty (simplicity of the model, measurement errors, natural variability, …).
The random error vector is modeled as white noise added to the signal \(\boldsymbol{X\beta}\), thus having a multivariate normal distribution:
\[\begin{equation} \boldsymbol{\epsilon} \sim \mathcal{N}_n\big(\boldsymbol{0}_{n \times 1},\; \sigma^2\boldsymbol{I}_{n \times n}\big) \end{equation}\]
meaning the error terms are independent and identically distributed (i.i.d.) random variables. The fact implies that \(Y_1, \dots, Y_n\) are normal independent distributed, but not identically because they have different means:
\[\begin{equation} \boldsymbol{Y} \sim \mathcal{N}_n\big(\boldsymbol{X\beta},\; \sigma^2\boldsymbol{I}_{n \times n}\big) \end{equation}\]
\[\begin{equation} \begin{aligned} E(Y_i) = \boldsymbol{X}_{i*}\boldsymbol{\beta}\\ Var(Y_i) = \sigma^2 \end{aligned} \end{equation}\]
where \(\boldsymbol{X}_{i*}\) is the \(i\)-th row of \(\boldsymbol{X}\).
A solution for our problem is found through the Maximum Likelihood Estimation method, which leads us to the least squares criterion: we are searching for a set of \(\boldsymbol{\beta}\) values which maximize the probability “to see what we saw” (the observed \(\mathbf{y}\)). We need to solve a maximization problem involving the likelihood function:
\[\begin{equation} \mathcal{L}\big(\boldsymbol{\beta}|\boldsymbol{Y}=\boldsymbol{y}\big)\propto exp\big(-||\boldsymbol{y}-\boldsymbol{\mu}||^2\big). \end{equation}\]
The starting maximization problem can be rewritten as a least squares minimization problem:
\[\begin{equation} \min_\boldsymbol{\beta}\;||\boldsymbol{y}-\boldsymbol{\mu}||^2=\min_\boldsymbol{\beta}\;||\boldsymbol{y}-\boldsymbol{X\beta}||^2 \end{equation}\]
letting us find the following estimator if \(\boldsymbol{\big(X^\top X\big)}\) is invertible:
\[\begin{equation} \hat{\boldsymbol{\beta}} = \big(\boldsymbol{X}^\top\boldsymbol{X}\big)^{-1}\boldsymbol{X}^\top \boldsymbol{Y}. \tag{4.1} \end{equation}\]
Being a linear transformation of \(\mathbf{Y}\) also \(\hat{\boldsymbol{\beta}}\) has a multivariate normal distribution:
\[\begin{equation} \hat{\boldsymbol{\beta}} \sim \mathcal{N}_p\Big(\boldsymbol{\beta}, \; \sigma^2\big(\boldsymbol{X}^\top\boldsymbol{X}\big)^{-1}\Big) \tag{4.2} \end{equation}\]
meaning it is an unbiased estimator, that is if we had to recompute \(\hat{\boldsymbol{\beta}}\) with different \(\mathbf{y}\) samples, on average the it does not tend to neither overestimate nor underestimate \(\boldsymbol{\beta}\) (unknown to us). Finally given the previous estimator and \(\boldsymbol{Y} = \boldsymbol{y}\) we can find the least squares coefficient estimates \(\hat{\beta}\)’s related to our problem, which enable us to make predictions. Let \(\boldsymbol{x}=\begin{bmatrix} x_1 & \cdots & x_p \end{bmatrix}^\top\) be a new set of predictor values, then:
\[\begin{equation} \hat{y} = \boldsymbol{x}^\top\hat{\boldsymbol{\beta}}. \end{equation}\]
First of all we need to introduce the vector of residuals defined as:
\[\begin{equation} \boldsymbol{e} = \boldsymbol{Y}-\boldsymbol{X}\hat{\boldsymbol{\beta}}. \end{equation}\]
The residuals are the sampling, measurable counterpart of errors. The following metrics are used in assessing the model goodness:
# mae --------------------------------------------------------------------------
mae <- function(actual, predicted){
mae <- mean(abs(actual-predicted))
return(mae)
}\[\begin{equation} MAE = \frac{|\boldsymbol{e}|\boldsymbol{1}}{n} = \frac{\sum_{i=1}^{n}{|Y_i - \hat{Y_i}|}}{n} \end{equation}\]
# rmse -------------------------------------------------------------------------
mse <- function(actual, predicted){
mse <- (actual-predicted)^2 %>% mean()
return(mse)
}
rmse <- function(actual, predicted){
rmse <- mse(predicted, actual) %>% sqrt()
return(rmse)
}\[\begin{equation} RMSE= \sqrt{\frac{||\boldsymbol{e}||^2}{n}} = \sqrt\frac{\sum_{i=1}^{n}{(Y_i - \hat{Y_i})^2}}{n} \end{equation}\]
The first and simplest among the tried models during the analysis is a simple linear regression model which consider \(bcNO_x\) as response variable and Ambient Temperature (AT) as predictor, namely:
\[\begin{equation} bcNO_x \approx \beta_0 + \beta_1\cdot AT. \end{equation}\]
It is known as population regression line and is the best linear approximation to the true relationship between \(AT\) and \(bcNO_x\). Applying the \(\hat{\boldsymbol{\beta}}\) estimator (4.1) we find the least squares line, also shown in figure 4.1:
\[\begin{equation} \hat{y} = \hat{\beta}_0+\hat{\beta}_1 x_1 \end{equation}\]
# SIPLE LINEAR REGRESSION MODEL ================================================
# model definition -------------------------------------------------------------
simple_lm <- lm(bcNOX ~ AT, data = emissions, subset = train)
# plot -------------------------------------------------------------------------
ggplot(data = emissions[train, ], aes(x = AT, y = bcNOX))+
geom_point(alpha = 0.5, color = "#121214", size = 1) +
geom_smooth(method = "lm", se=FALSE, color = fill_color, size = 0.8)+
ylab("bcNOx") +
xlab("Ambient Temperature") +
ggtitle("Least squares line")+
theme(plot.title = element_text(hjust = 0.5))Figure 4.1: simple_lm regression model.
Through the summary R command is possible to have detailed information about the model.
##
## Call:
## lm(formula = bcNOX ~ AT, data = emissions, subset = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.062621 -0.005222 0.000185 0.005414 0.032022
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.386e+00 1.536e-04 9022.10 <2e-16 ***
## AT -7.318e-04 7.987e-06 -91.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.00808 on 18365 degrees of freedom
## Multiple R-squared: 0.3137, Adjusted R-squared: 0.3137
## F-statistic: 8394 on 1 and 18365 DF, p-value: < 2.2e-16
After the regression formula we can split the information in three main parts:
The first part is information related to residuals, where we can see some statistics which give us a rough idea of the distribution of the residuals. We can see in our case that they are distributed around zero and that min and max residuals could refer to some strange observations, remember that in the Q-Q plot in figure 3.3 we observed some points in the tails quite far from the normal-assumption line.
The second part is information related to estimated coefficients, where we can see the estimated values with their standard error, and the p-values which reject the hypothesis that the related predictor has a negligible influence on the response. From (4.2) we can retrieve both the real and approximated SE of each \(\hat{\beta}_i\), where the latter is computed using the Residual Standard Error \(\hat{\sigma}\), instead of the unknown error standard deviation \(\sigma\).
\[\begin{equation} SE(\hat{\beta}_i) = \sigma\sqrt{(\boldsymbol{X}^\top \boldsymbol{X})^{-1}_{i+1,i+1}} \\ \widehat{SE}(\hat{\beta}_i) = \hat{\sigma}\sqrt{(\boldsymbol{X}^\top \boldsymbol{X})^{-1}_{i+1,i+1}} \end{equation}\]
We can also easily compute the \(95\%\) confidence interval for each \(\hat{\beta}_i\), in our case
## 2.5 % 97.5 %
## (Intercept) 1.3857553981 1.3863576529
## AT -0.0007474302 -0.0007161194
The third part is information related to the goodness of fit, where we can see:
\[\begin{equation} \hat{\sigma} = \sqrt{\frac{\boldsymbol{e}^\top\boldsymbol{e}}{n-p}} \end{equation}\]
\[\begin{equation} R^2 = \frac{SS_{reg}}{SS_{tot}}=\frac{\sum_{i=1}^{n}{(\hat{Y_i}-\overline{Y})}}{\sum_{i=1}^{n}{(Y_i-\overline{Y})}}\\ R_{adj}^2 = 1-(1-R^2)\frac{n-1}{n-p-1} \end{equation}\]
The next step is to analyze a multiple linear regression model plus investigating the effect of an interaction term. To begin with in a multiple linear regression model we interpret \(\beta_i\) to be the average effect on \(Y\) of a one unit increase in \(X_i\), holding all the other predictors fixed. We now use also Ambient Humidity (AH) as predictor obtaining:
\[\begin{equation} bcNO_x \approx \beta_0+\beta_1 \cdot AT+ \beta_2 \cdot AH. \end{equation}\]
Found the coefficient estimates we can compute the least squares plane:
\[\begin{equation} \hat{y} = \hat{\beta_0}+\hat{\beta_1} \cdot x_1+ \hat{\beta_2} \cdot x_2 \end{equation}\]
with \(\hat{y}\) prediction on the basis of \(AT = x_1\) and \(AH = x_2\), shown on the left side of figure 4.2.
# MLTIPLE LINEAR REGRESSION MODELS =============================================
# models -----------------------------------------------------------------------
# multiple linear regression
multiple_lm <- lm(bcNOX ~ AT+AH, data = emissions, subset = train)
# multiple linear regression with interactions
interaction_lm <- lm(bcNOX ~ AT*AH, data = emissions, subset = train)
# interaction effect plots -----------------------------------------------------
at <- pretty(0:35, n=10)
ah <- pretty(27:101, n=10)
grid <- expand.grid(ah, at)
d <- setNames(data.frame(grid), c("AH", "AT"))
# define the regression plane
vals1 <- predict(multiple_lm, newdata = d)
plane1 <- t(matrix(vals1, nrow = length(ah), ncol = length(at)))
vals2 <-predict(interaction_lm, newdata = d)
plane2 <- t(matrix(vals2, nrow = length(ah), ncol = length(at)))
# define colors
marker_col <- "#121214"
high_col <- fill_color
low_col <- "#d7d7de"
# create the figures
fig1 <- plot_ly(
emissions[train, ],
x = ~AH,
y = ~AT,
z = ~bcNOX,
size=1,
alpha=0.5,
scene = "scene1",
marker = list(
symbol = 'circle',
sizemode = 'diameter'),
sizes = c(3)
)
fig1 <- fig1 %>%
add_markers(color = I(marker_col))
fig1 <- fig1 %>%
add_surface(z = ~plane1, x = ~ah, y = ~at, showscale = FALSE, colorscale = list(c(0, low_col), c(1, high_col)))
fig2 <- plot_ly(
emissions[train, ],
x = ~AH,
y = ~AT,
z = ~bcNOX,
size=1,
alpha=0.5,
scene = "scene2",
marker = list(
symbol = 'circle',
sizemode = 'diameter'),
sizes = c(3)
)
fig2 <- fig2 %>%
add_markers(color = I(marker_col))
fig2 <- fig2 %>%
add_surface(z = ~plane2, x = ~ah, y = ~at, showscale = FALSE, colorscale = list(c(0, low_col), c(1, high_col)))
# create the subplot
fig <- subplot(fig1, fig2)
fig <- fig %>%
layout(
scene = list(
domain = list(x = c(0, 0.49), y = c(0, 1)),
xaxis = list(title = 'AH'),
yaxis = list(title = 'AT'),
zaxis = list(title = 'bcNOx')
),
scene2 = list(
domain = list(x = c(0.51, 1), y = c(0, 1)),
xaxis = list(title = 'AH'),
yaxis = list(title = 'AT'),
zaxis = list(title = 'bcNOx')
),
annotations = list(
list(x=0.15, y = 1, text="least squares plane\n(no interaction)", showarrow = F, xref="paper", yref="paper"),
list(x=0.85, y = 1, text="least squares surface\n(interaction effect)", showarrow = F, xref="paper", yref="paper")
),
showlegend = FALSE)
figFigure 4.2: interaction effect in multiple linear regression.
Interacting with the plot is quite clear that some regions of the least squares plane tend to overestimate (when AT and AH values are both low or high) or underestimate (when one of them have high values and the other low) the fitted values, this means that there is probably an interaction effect between \(AH\) and \(AT\) that we are missing, the introduction of this term may improve the model, which becomes:
\[\begin{equation} \begin{aligned} bcNO_x \approx \beta_0+\beta_1\cdot AT + \beta_2 \cdot AH + \beta_3 \cdot(AH \cdot AT)\\ = \beta_0 + (\beta_1+\beta_3\cdot AH)\cdot AT + \beta_2 \cdot AH\\ = \beta_0 + \tilde{\beta}_1 \cdot AT + \beta_2 \cdot AH \end{aligned} \tag{4.3} \end{equation}\]
meaning that the effect of \(AT\) on \(bcNO_x\) is no longer constant. \(\beta_3\) can be considered as the increase in the \(AT\) contribution for a unit increase in \(AH\). Right side of figure 4.2 shows how the least squares surface now better fit the data. After the introduction of the interaction effect, the \(R^2\) value increases from 0.33 to 0.37.
The first operation of this step is to identify if there is multi-collinearity , which is a problem in terms of model interpretation. Multi-collinearity means there are some highly correlated independent variables: we cannot estimate the relationship between each independent variable and the response variable if independent variables tend to change in unison. Multi-collinearity makes coefficients estimates very sensitive to small changes in the model and reduces the precision of coefficients estimates, which weakens the statistical power of the model. The Variance Inflation Factor (VIF) is a ratio which helps in identifying multi-collinearity problems. The VIF related to the \(j^{th}\) coefficient estimate \(\hat{\beta}_j\) measures how much the variance of \(\hat{\beta}_j\) is inflated by the collinearity and is defined as
\[\begin{equation} VIF_j = \frac{1}{1-R^2_{X_j|X_{-j}}} \end{equation}\]
where \(R^2_{X_j|X_{-j}}\) is the \(R^2\) when \(X_j\) is regressed on all the remaining predictors. When the other predictors well explain the behavior of \(X_j\) the coefficient of determination will be near 1, as a result the VIF will have an high value, on the other side if the other predictors poorly explain \(X_j\), the coefficient of determination will near 0, as a result VIF will be near 1. As a rule of thumb, a squared VIF value that exceeds 2 highlights a problematic amount of multi-collinearity. The adopted approach is to recompute the VIFs fpr each predictor starting from the complete model, dropping one at a time the predictor with the highest associated VIF value, until all VIF values are ok. In our case we need to eliminate Compressor Discharge Pressure (CDP), Turbine Energy Yield (TEY) and Gas Turbine Exhaust Pressure (GTEP). The following are the squared VIF values before and after the predictors elimination.
# MULTICOLLINEARITY MEASURE ====================================================
# vif computation --------------------------------------------------------------
big_model <- lm(bcNOX ~ .-NOX-year, data=emissions, subset=train)
before <- vif(big_model) %>% sqrt()
#vif(big_model)
#sqrt(vif(big_model))>2
big_model <- lm(bcNOX ~ .-NOX-year-TEY, data=emissions, subset=train)
#vif(big_model)
#sqrt(vif(big_model))>2
big_model <- lm(bcNOX ~ .-NOX-year-TEY-CDP, data=emissions, subset=train)
#vif(big_model)
#sqrt(vif(big_model))>2
big_model <- lm(bcNOX ~ .-NOX-year-TEY-CDP-GTEP, data=emissions, subset=train)
after <- vif(big_model) %>% sqrt()
after <- c(after[1:4], NaN, after[5:6], NaN, NaN)
#vif(big_model)
#sqrt(vif(big_model))>2
# vif results visualization-----------------------------------------------------
vif_df <- data.frame("before" = before %>% round(digits=2), "after"= after %>% round(digits=2))
as.matrix(vif_df)## before after
## AT 2.87 1.47
## AP 1.23 1.14
## AH 1.24 1.22
## AFDP 1.58 1.57
## GTEP 5.50 NaN
## TIT 6.22 1.41
## TAT 3.20 1.32
## TEY 15.92 NaN
## CDP 15.72 NaN
Now that we selected 6 predictors we can study if there exist just a subset of them that is able to fit the data as well as all of them do. Since there is a small number of predictors we can use the all subsets regression approach, through the regsubsets function in the leaps library. We ask for the 4 best models of each predictors subset size through the parameter nbest, and specify the coefficient of determination \(R^2\) as criterion for reporting the best model.
leaps <- regsubsets(bcNOX ~ AP+AT+AH+AFDP+TIT+TAT, data=emissions[train, ], nbest=4, method="exhaustive")
plot(leaps, scale="r2")Figure 4.3: all subset regression for feature selection.
From the result we can see that selecting AP, AT, AH and TAT we have almost the same result as selecting all the predictors, so we can define the big linear model big_lm with these subset of predictors. We can also add the AT:AH interaction effect as we already did and define a second big model we will call big_interaction_lm. In the next section the models defined so far will be compared and their performance on the validation set will be evaluated.
A general approach to select a good model is try to find the simplest one which still can well explain the data. According this view we can compare models carrying out the Anova type test, or through the Akaike’s Information Criterion (AIC).
The Anova test let us comparing a bigger model vs a smaller nested model in order to understand if the smaller model is good enough, in the next example the simple_lm and the multiple_lm are compared.
# ANOVA ========================================================================
print(anova(simple_lm, multiple_lm))## Analysis of Variance Table
##
## Model 1: bcNOX ~ AT
## Model 2: bcNOX ~ AT + AH
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 18365 1.1989
## 2 18364 1.1683 1 0.030569 480.48 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The test is an F-based test and a low p-value means we can reject the null hypothesis that the smaller model is good enough in explaining the data. Through the AIC also non-nested models can be compared: it takes into account the number of estimated parameters plus one (the variance), penalizing model with an high number of parameters. The driving idea is that having a smaller number of parameters means searching for a simpler explanation of the world, which is much preferable. This concept is encoded in AIC define as
\[\begin{equation} AIC = 2(p+1)-log(\hat{L}) \end{equation}\]
where \(\hat{L}\) is the maximized likelihood function and a low AIC value means the associated model is a good one. To select the best model we can now compare all the models defined so far, also computing the evaluation metrics for the original NOx response.
# AIC and MODEL PERFORMANCE ====================================================
# inverse box cox --------------------------------------------------------------
invBC <- function(x, lmb){
y <- (x*lmb+1)^(1/lmb)
return(y)
}
# validation set predictions ---------------------------------------------------
pred_simp <- predict(simple_lm, emissions, type="response")[val] %>% invBC(lmb=lambda)
pred_multi <- predict(multiple_lm, emissions, type="response")[val] %>% invBC(lmb=lambda)
pred_inter <- predict(interaction_lm, emissions, type="response")[val] %>% invBC(lmb=lambda)
pred_big <- predict(big_lm, emissions, type="response")[val] %>% invBC(lmb=lambda)
pred_big_inter <- predict(big_interaction_lm, emissions, type="response")[val] %>% invBC(lmb=lambda)
# actual validation values -----------------------------------------------------
actual <- emissions[val, 'NOX']
# mae --------------------------------------------------------------------------
mae_simp <- mae(actual, pred_simp) %>% round(digits=3)
mae_multi <- mae(actual, pred_multi) %>% round(digits=3)
mae_inter <- mae(actual, pred_inter) %>% round(digits=3)
mae_big <- mae(actual, pred_big) %>% round(digits=3)
mae_big_inter <- mae(actual, pred_big_inter) %>% round(digits=3)
# rmse -------------------------------------------------------------------------
rmse_simp <- rmse(actual, pred_simp) %>% round(digits=3)
rmse_multi <- rmse(actual, pred_multi) %>% round(digits=3)
rmse_inter <- rmse(actual, pred_inter) %>% round(digits=3)
rmse_big <- rmse(actual, pred_big) %>% round(digits=3)
rmse_big_inter <- rmse(actual, pred_big_inter) %>% round(digits=3)
# performance df ---------------------------------------------------------------
comparison_df <- data.frame(
AIC(simple_lm, multiple_lm, interaction_lm, big_lm, big_interaction_lm),
"MAE" = c(mae_simp, mae_multi, mae_inter, mae_big, mae_big_inter),
"RMSE" = c(rmse_simp, rmse_multi, rmse_inter, rmse_big, rmse_big_inter))
row.names(comparison_df) <- c("simple_lm", "multiple_lm", "interaction_lm", "big_lm", "big_interaction_lm")
# visualization ----------------------------------------------------------------
as.matrix(comparison_df)## df AIC MAE RMSE
## simple_lm 3 -124871.8 7.248 9.656
## multiple_lm 4 -125344.2 7.222 9.678
## interaction_lm 5 -126353.2 7.017 9.362
## big_lm 6 -125792.2 7.053 9.542
## big_interaction_lm 7 -126792.0 6.830 9.236
The big_interaction_model is clearly the better one among them, although having the higher number of parameters.
Standardized residuals are the residuals divided by their estimated standard deviation, they are an important tools to find possible outliers and, combined with the fitted values, a diagnostic tool to understand if the regression model is reliable or not. The plot standardized residual vs fitted values should not shows any pattern, if so it means that the residuals are not white noise but contains some unexplained information, maybe due to missing features in the collected data or a wrong predictors choice.
# stardardized resilduals df ---------------------------------------------------
diagnostic_df <- data.frame("res"= rstandard(big_interaction_lm), "fitted"= big_interaction_lm$fitted.values)
# residuals vs fitted values ---------------------------------------------------
res_plot <- ggplot(data = diagnostic_df, aes(x=fitted, y=res))+
geom_point(alpha = 0.5, size = 1, color = "#121214")+
geom_smooth(se = FALSE, color = fill_color, size=0.9)+
geom_abline(slope = 0, intercept = 0, color="red", linetype = 2, size = 0.5)+
xlab("Fitted values")+
ylab("Standardized residuals") +
ggtitle("Residuals vs Fitted") +
theme(plot.title = element_text(hjust = 0.5))
# Q-Q plot ---------------------------------------------------------------------
qq_res <- ggplot(data = diagnostic_df, aes(sample = res))+
geom_qq(alpha = 0.5, size = 1, color = "#121214")+
stat_qq_line(color = "red", alpha = 0.7, size = 0.5, linetype = 2)+
ylab("Standardized residuals") +
xlab("Theoretical Quantiles")+
ggtitle("Normal Q-Q") +
theme(plot.title = element_text(hjust = 0.5))
# visualization ----------------------------------------------------------------
ggarrange(res_plot, qq_res, ncol = 2, nrow = 1, align = "h")
Figure 4.4: diagnostic plots for big_interaction_lm model.
From the previous plot we can understand that there is the presence of some outliers, which the model tends to overestimate, and that both the statistical assumption of heteroscedasticity and normality of errors is somehow violated. In conclusion also considering that the best model big_interaction_lm does not well fit the data, in fact only 38.21% of the total variability is explained by the model, it is likely to be not a good model choice. In the next section we will try to overcome the limitation of the linear model with a non linear regression model.
Another kind of highly non-linear models, namely Random Forest, is investigated in the next sections in trying to overcome the linear regression model limits. The idea behind the random forest is to combine a large number of simpler elements, the regression trees, exploiting all of them to obtain predictions for each observation and then averaging all the predictions for the same observation to get a reliable result.
The single regression tree is a non-linear model which starting from the entire predictor space, segments it in different regions and then computes for each region the mean response for the training observations which fall in that region, this value will be the prediction made by the tree for an unseen observation which will fall in that specific region. The goal is to find a set of \(J\) regions \(R_1,\dots,R_J\), which in our case will be high-dimensional rectangles, which minimizes the \(RSS\)
\[\begin{equation} \sum_{j=1}^{J}{\sum_{i \in R_j}{(y_i-\hat{y}_{R_j})^2}} \tag{5.1} \end{equation}\]
where \(\hat{y}_{R_j}\) is the prediction associated to the jth region. The approach to define the regions is called recursive binary splitting: the idea is to start from the entire predictor space being a single region and find the predictor \(X_j\) and the cut point \(s\) in order to split the starting region in two sub-regions that results in the highest decreasing in \(RSS\). More technically for each \(j\) and \(s\) are created two regions:
\[\begin{equation} R_1(j,s) = \{X|X_j<s\} \mbox{ and } R_2(j,s) = \{X|X_j>s\} \tag{5.2} \end{equation}\]
where \(R_1(j,s)\) (\(R_2(j,s)\)) is the region of the predictor space where \(X_j < s\) (\(X_j > s\)). Then we search for the \(j\) and \(s\) values which minimize
\[\begin{equation} \sum_{i:\;x_i \in R_1(j,s)}{(y_i-\hat{y}_{R_1})^2}+\sum_{i:\;x_i \in R_2(j,s)}{(y_i-\hat{y}_{R_2})^2} \tag{5.3} \end{equation}\]
where \(\hat{y}_{R_1}\) and \(\hat{y}_{R_2}\) are the mean response of training observation which fall in region \(R_1(j,s)\) and \(R_2(j,s)\). At each step the tree-building process search for the best split without taking into account a split which could be lead to a better tree in future steps. The process continues and now one of the two regions is split obtaining three regions, and so on until a stop condition is reached (e.g a certain number of nodes in the leaves, a minimum reduction threshold in \(RSS\)). Applying a three-based method in a two dimensional space, it will search the best step function which fit the available data, such as in the next example.
# REGRESSION TREE EXAMPLE ======================================================
# parable function
parable <- function(x){
y <- x^2 - 6*x + 10
return(y)
}
# set seed to get reproducible results
set.seed(42)
# define example.data
x <- runif(150, min = 1, max = 5)
noise <- rnorm(length(x), mean = 0, sd = 0.2)
y <- parable(x) + noise
example.data <- data.frame(x = x, y = y)
# define an example task
example.task <- TaskRegr$new(id = "example", backend = example.data, target = "y")
# define the regression tree learner
rt <- lrn("regr.rpart")
rt$train(example.task)
# define example test data
x_test <- seq(1, 5, by = 0.01)
y_test <- parable(x_test)
test.data <- data.frame(x = x_test, y = y_test)
# define example test task
test.task <- TaskRegr$new(id = "example_test", backend = test.data, target = "y")
# make predictions
predictions <- rt$predict(test.task)$response
# plot example data points and predictions
ggplot() +
geom_point(aes(x, y), alpha = 0.7) +
geom_line(aes(test.data$x, predictions), color = fill_color, size = 1)Figure 5.1: regression tree fitting a noiy signal.
Pay attention that the although being an easy model to understand regression-tree is an high variance procedure: if applied repeatedly to different datasets drawn from the same data distribution, three-based methods could yield quite different results. They have a low generalization capability and tend to easily overfit the available data. By aggregating many regression-trees the predictive performances can be greatly improved.
As already said the idea which is behind the random forest is use a large number of regression tree, 500 in our case, to make prediction for the observations and then average the prediction for the same observation in order to obtain a low variance prediction. To obtain a low-variance statistical model, each tree is build over a different bootstrapped version of the original training set, and at each tree-building-process step only a subset of the \(m\) predictors is selected, this will ensure that if there exist a strong predictors it will not be taken into account at each step and the resulting trees will be decorrelated (the average of many decorrelated quantities does lead to an higher decrease in variability than the average of many correlated quantities does). The first step now is to define a random forest model to use in our case and evaluate its performances. Due to the fact that it is likely to overfit the data, to better evaluate its real performances, the model is built on the training set and then RMSE and MAE are evaluated both on training set and also on the validation set.
# task ------------------------------------------------------------------------
task <- TaskRegr$new(
id = "nox",
backend = emissions %>% select(-c(NOX, year)),
target = "bcNOX"
)
# learner ----------------------------------------------------------------------
rf_one <- lrn("regr.ranger", num.trees = 500, splitrule = "variance")
rf_one$train(task, row_ids = train)
train_pred <- rf_one$predict(task, row_ids = train)$response %>% invBC(lmb=lambda)
val_pred <- rf_one$predict(task, row_ids = val)$response %>% invBC(lmb=lambda)
# resume metrics in a table-----------------------------------------------------
# save actual values
actual_train <- emissions$NOX[train]
actual_val <- emissions$NOX[val]
# metrics matrix
mat_metrics <- matrix(nrow=2, ncol=2)
rownames(mat_metrics) <- c("MAE","RMSE")
colnames(mat_metrics) <- c("train", "val")
mat_metrics[,"train"] <- c(
mae(actual_train, train_pred),
rmse(actual_train, train_pred)
)
mat_metrics[,"val"] <- c(
mae(actual_val, val_pred),
rmse(actual_val, val_pred)
)
mat_metrics %>% round(digits=2)## train val
## MAE 1.24 2.90
## RMSE 2.01 4.49
As suspected the performances in validation are worse than the performances evaluated on the training set. To have an idea of the random forest features we can plot two histograms related the number of total nodes in each tree and the tree depth.
# recursive function to get the tree depth--------------------------------------
# returns the number of nodes of the longest paths, root included
maxDepth <- function(f_idx, lefts, rights){
if (lefts[f_idx] == 0 & rights[f_idx] == 0){
return(1)
}
left_depth <- maxDepth(lefts[f_idx]+1, lefts, rights)
right_depth <- maxDepth(rights[f_idx]+1, lefts, rights)
return(max(left_depth, right_depth)+1)
}
# function to get the depths of all the trees in the forest---------------------
# returns the tree depth for each tree in the forest
forestDepths <- function(rf){
depths <- numeric(0)
for (i in 1:rf$model$num.trees){
lefts <- rf$model$forest$child.nodeIDs[[i]][[1]]
rights <- rf$model$forest$child.nodeIDs[[i]][[2]]
depths[i] <- maxDepth(1, lefts, rights)
}
return(depths)
}
# random forest depts and number of nodes---------------------------------------
nodes <- lengths(rf_one$model$forest$split.varIDs)
depths <- forestDepths(rf_one) - 1
# resumes depths and number of nodes in a dataframe
rf_description_df <- data.frame(
nodes = nodes,
depths = depths
)
# depths and number of nodes histograms-----------------------------------------
nodes_hist <- ggplot()+
geom_histogram(aes(nodes), data = rf_description_df, bins = 15, fill=fill_color) +
xlab("number of nodes") +
ylab("trees count")
depths_hist <- ggplot()+
geom_histogram(aes(depths), data = rf_description_df, bins = 15, fill=fill_color) +
xlab("tree depth") +
ylab("trees count")
# create a plot grid
ggarrange(nodes_hist, depths_hist, ncol = 2, nrow = 1, align = "v")All the trees have a total number of nodes greater than twelve thousands and a depth which exceeds 30 levels. As depth is meant the longest path from the root to the leaves. In order to mitigate the overfitting the following hyper-parameters are tuned:
mtry: the number of \(p\) predictors selected at each step, assumes values \(\{3, 4, 5, 6 \}\),
max.depth: the maximum depth a tree could have, assumes values \(\{10, 15, 20\}\).
All the 12 possible combinations of the previous hyper-parameters are tested through 3-folds cross validation and eventually the combination which gives the lowest RMSE is selected and used to define the final model.
# leaner------------------------------------------------------------------------
rf <- lrn("regr.ranger", num.trees = 500, splitrule = "variance")
# training task ---------------------------------------------------------------
train_task <- TaskRegr$new(
id = "train",
backend = emissions[c(train, val), ] %>% select(-c(NOX, year)),
target = "bcNOX"
)
# tuning space------------------------------------------------------------------
tune_ps <- ParamSet$new(
list(
ParamDbl$new("max.depth", lower = 5, upper = 20),
ParamInt$new("mtry", lower = 2, upper = 6)
)
)
# evaluation measure------------------------------------------------------------
MeasureOriginalRootMSE = R6::R6Class("MeasureOriginalRootMSE",
inherit = mlr3::MeasureRegr,
public = list(
initialize = function() {
super$initialize(
# custom id for the measure
id = "original_root_mse",
# additional packages required to calculate this measure
packages = character(),
# properties, see below
properties = character(),
# required predict type of the learner
predict_type = "response",
# feasible range of values
range = c(0, Inf),
# minimize during tuning?
minimize = TRUE
)
}
),
private = list(
# custom scoring function operating on the prediction object
.score = function(prediction, ...) {
original_root_mse = function(truth, response) {
original_mse = mean((invBC(truth, lmb=lambda) - invBC(response, lmb=lambda))^2)
sqrt(original_mse)
}
original_root_mse(prediction$truth, prediction$response)
}
)
)
mlr3::mlr_measures$add("original_root_mse", MeasureOriginalRootMSE)
measure <- msr("original_root_mse")
# terminator--------------------------------------------------------------------
terminator <- trm("none")
# tuning instances--------------------------------------------------------------
instance <- TuningInstanceSingleCrit$new(
task = train_task,
learner = rf,
resampling = rsmp("cv", folds = 3),
measure = measure,
search_space = tune_ps,
terminator = terminator
)
# tuner-------------------------------------------------------------------------
design <- data.table(expand.grid(max.depth = seq(10, 20, by = 5), mtry = 3:6))
tuner <- tnr("design_points", design = design)
# optimize instances------------------------------------------------------------
tuner$optimize(instance)The following is the best hyper-parameters values combination and the resulting RMSE is computed in the original space. As can be seen in the previous code chunk to do that a new measure object identified as original_root_mse, which inherit mlr3:MeasureRegr, is defined and added to the available mlr3 measures. The new measure is retrieved through msr("original_root_mse") and inserted in the tuning instance object, exploited in tuning the model.
# TUNING RESULTS ===============================================================
print(instance$result)## max.depth mtry learner_param_vals x_domain original_root_mse
## 1: 20 5 <list[5]> <list[2]> 4.384769
After tuning, the metrics are again evaluated on the validation set and compared with the previous one to show the actual enhancement in performances.
rf <- lrn("regr.ranger")
param_list <- list(
splitrule = "variance",
num.trees = 500,
importance = "impurity",
mtry = NULL,
max.depth = NULL )
# retrain models on train---------------------------------------------------
# nox
param_list$mtry <- instance$result$mtry
param_list$max.depth <- instance$result$max.depth
rf$param_set$values <- param_list
rf$train(task, row_ids = train)
new_train_pred <- rf$predict(task, row_ids = train)$response %>% invBC(lmb=lambda)
new_val_pred <- rf$predict(task, row_ids = val)$response %>% invBC(lmb=lambda)
# print results ----------------------------------------------------------------
# metrics dataframe
mat_tuned_metrics <- matrix(nrow=2, ncol=2)
rownames(mat_tuned_metrics) <- c("MAE","RMSE")
colnames(mat_tuned_metrics) <- c("val", "val_after_tuning")
mat_tuned_metrics[,"val"] <- c(
mae(actual_val, val_pred),
rmse(actual_val, val_pred)
)
mat_tuned_metrics[,"val_after_tuning"] <- c(
mae(actual_val, new_val_pred),
rmse(actual_val, new_val_pred)
)
mat_tuned_metrics %>% round(digits=2)## val val_after_tuning
## MAE 2.90 2.83
## RMSE 4.49 4.41
The big_interaction_lm linear regression model and the rf random forest model with the best parameters are now trained on the combination of training and validation set, and their performance is evaluated on the test set. To compare their behavior for both models actual test values vs predicted values plots are shown in figure 6.1.
# BEST MODELS TEST PERFORMANCES=================================================
# models------------------------------------------------------------------------
# definition
lm <- lm(bcNOX ~ AT+AH+AP+TAT+AT:AH, data = emissions, subset = -test)
rf <- lrn("regr.ranger", num.trees=500, splitrule="variance", max.depth=instance$result$max.depth, mtry=instance$result$mtry)
rf$train(task, row_ids = c(train, val))
# predictions
actual_test <- emissions$NOX[test]
test_pred_lm <- predict(lm, emissions, type = "response")[test] %>% invBC(lmb=lambda)
test_pred_rf <- rf$predict(task, row_ids = test)$response %>% invBC(lmb=lambda)
# results table-----------------------------------------------------------------
mat_test <- matrix(nrow=2, ncol=2)
rownames(mat_test) <- c("MAE","RMSE")
colnames(mat_test) <- c("lm", "rf")
mat_test[,"lm"] <- c(
mae(actual_test, test_pred_lm),
rmse(actual_test, test_pred_lm)
)
mat_test[,"rf"] <- c(
mae(actual_test, test_pred_rf),
rmse(actual_test, test_pred_rf)
)
mat_test %>% round(digits=2)Looking at the evaluated test metrics and the following plots is quite clear that the random forest outclasses the linear regression model.
act_pred_df <- data.frame(
actual = actual_test,
lm_pred = test_pred_lm,
rf_pred = test_pred_rf
)
lm_point <- ggplot() +
geom_point(data = act_pred_df, aes(actual, lm_pred), alpha = 0.6) +
geom_abline(slope = 1, intercept = 0, color = "red", alpha = 0.6, linetype = 2, size = 0.6) +
xlab("actual test response") +
ylab("lm predicted response")
rf_point <- ggplot()+
geom_point(data = act_pred_df, aes(actual, rf_pred), alpha = 0.6) +
geom_abline(slope = 1, intercept = 0, color = "red", alpha = 0.6, linetype = 2, size=0.6)+
xlab("actual test response") +
ylab("rf predicted response")
ggarrange(lm_point, rf_point, ncol = 2, nrow = 1, align = "v")Figure 6.1: final models behavior.
Linear regression models are the simplest models one can exploit in carrying out a regression task, considering the high complexity of some problems they sometimes cannot result in having an high accuracy, but what differentiate them from other approaches is the statistical information which can be retrieved about the predicted coefficients. In contrast, non linear models such as the Random Forest, could yield better accuracy taking into account the non-linearity inherent in natural phenomema. In real PEMS design both the models are discarded in favor of more accurate and complex models which lie in the field of neural networks, following this direction a lately investigated model is the Extreme Learning Machine, whose application in this field is studied in Heysem KAYA (2019).
Heysem KAYA, Erdinç UZUN, Pınar TÜFEKCİ. 2019. “Predicting Co and Nox Emissions from Gas Turbines: Novel Data and a Benchmark Pems.” Turkish Journal of Electrical Engineering & Computer Sciences 27 (July): 4783–96. https://doi.org/10.3906/elk-1807-87.
Kinga Skalska, Stanislaw Ledakowicz, Jacek S. Miller. 2010. “Trends in Nox Abatement: A Review.” Science of the Total Environment 408 (June): 3976–89. https://doi.org/10.1016/j.scitotenv.2010.06.001.